Deep Neural Networks (DNNs)

Deep Neural Networks (DNNs), cont’d

Training DNNs

Training DNNs, cont’d

playground.tensorflow.org

Note: playground.tensorflow.org is an educational tool. It does not actually use the TensorFlow library, nor can you use it to train with your data.

Underfitting (high bias)

Symptoms:

Possible treatments:

Overfitting (high variance)

Symptoms:

Possible treatments:

Regularization

playground.tensorflow.org

Note: playground.tensorflow.org is an educational tool. It does not actually use the TensorFlow library, nor can you use it to train with your data.

Nonlinear regression

Load data

Build the model

Define the structure of the DNN. Here, we define two hidden layers, with 5 neurons in each layer.

We also specify the activation function here. The relu function is commonly used, but you can use others (examples: Wikipedia):

sigmoid, softplus, tanh, etc.

Note that no activation is used on the final layer.

Experiment with the hidden units and activation function.

L1, L2 regularization

Dropout

Training

Stochastic gradient descent methods use shuffled mini-batches instead of the entire data set for each training iteration. We specify batch size, and how many epochs to train the code.

An epoch is the number of training iterations required to go through the entire training set once. For example, 1,000 datapoints and a batch size of 10, one epoch would take 100 training iteration.

We can also specify validation data to see how the validation loss changes during training.

Experiment with batch size and number of epochs.

Results

With good settings in the code (not the current settings), we can get the following fit:

Exercise 1

Classification

Iris versicolor, by Danielle Langlois (CC BY-SA 3.0), commons.wikimedia.org/w/index.php?curid=248095

Import data

Data label format: Usually given as 0, 1, or 2; we need it to be [1,0,0], [0,1,0], or [0,0,1].

Build the model

Define the structure of the DNN. Here, we define three hidden layers, with 1000, 500, and 70 neurons in each respective layer.

Since this is classification, apply the softmax function to the last layer. This transforms the output to be a vector of probabilities that sum to one: \[\begin{aligned} p_i &= \frac{\exp(f_i)}{\sum\limits_j \exp(f_j)}\end{aligned}\] where \(p_i\) is probability of category \(i\) being true, \(f_i\) is \(i\)-th component of the final layer’s output.

Loss

We again define the loss function and the optimizer. For classification, we use the cross entropy loss function. We are also interested in the accuracy metric (% correctly classified), in addition to the loss.

\[\begin{aligned} \mathrm{cross\_entropy} = \frac{1}{n_\mathrm{samples}}\sum\limits_j^{n_\mathrm{samples}}\sum\limits_i^{n_\mathrm{classes}}\hat{p}_i^j\log(p_i^j)\end{aligned}\] where \(\hat{p}_i^j\) is the data and \(p_i^j\) is the prediction for class \(i\), sample \(j\).

Training

Training is done as before.

Exercise 2

Convolutional Neural Network (CNN)

Initialize model, Normalize input

We shift and normalize the inputs for better fitting.

We also define the input shape. The images are 28 by 28 pixels, with a grayscale value. This means each image is defined by a 3D tensor, \(28\times28\times1\) (a color image of the same size would be \(28\times28\times3\)).

Convolutional layer

The first convolutional layer is applied. This involves sweeping a filter across the image. (Gives "translational invariance.")

We use 4 filters with a size of \(5\times5\) pixels, with ReLU activation.

Max pooling

Max pooling involves looking at clusters of the output (in this example, \(2\times2\) clusters), and sets the maximum filter value as the value for the cluster.

I.e. a “match” anywhere in the cluster \(\implies\) a “match” for the cluster.

Since we are also using stride of 2, the clusters don’t overlap.

Pooling reduces the size of the neural net, speeding up computations.

2nd convolution and pooling

A second convolutional layer, followed by max pooling, is used.

Fully-connected layer

The 3D tensor is converted back to a 1D tensor to act as input for a dense or fully-connected layer, the same type used with the previous regression and classification examples.

Dropout, Softmax

We add a dropout layer here. In this example, dropout happens at a rate of 40% (i.e. 40% of weights are temporarily set to zero at each training iteration).

As in the Iris classification problem, we finish with a dense layer and softmax activation function to return probabilities for each category.

Compile, Train

We compile and train as in the previous classification example:

Exercise 3